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Abstract — This paper describes a 
preconditioned conjugate gradient method that can 
be effectively Implemented on both vector machines 
and parallel arrays to solve sparse symmetric and 
positive definite systems of linear equations) 
The Implementation on the CYBER 203/205 and on the 
Finite Element Machine la discussed and results 
obtained using the method on these machines are 
given* 


Introduction 


In this paper we are concerned with the 
solution of a sparse N « N system of symmetric 
and positive definite linear equations 

Ku - f (1.1) 


by preconditioned conjugate gradient (PCG) methods 
on both vector computers and parallel arrays. 
Several descriptions of these methods appear la 
the literature; sec for example, Concus, Golub, 
O'Leary [1976] and Chandra [1978J. Also, Schrleber 
[1978] discussed the Implementation of conjugate 
gradient (CG) on vector computers and Podsladlo 
and Jordan [1981) discussed Its Implementation on 
the Finite Element Machine under construction at 
NASA Langley Research Center. 

The PCG method solves the system Ku « f 

where 

ic - Q T M -1 KQ" T , u - Q T u, f - <T l f, (1.2) 


Q Is a nonsingular matrix, and the symmetric and 
positive definite preconditioning matrix is given 
by M ■ QQ . The algorithm for the solution of 
u directly Is described In Chandra [1978] and Is 
given below where u, r, ?, and p are vectors 
and (x,y) denotes the Inner product x T y. 

(1) Choose u° 

(2) r° - f - Ku° 

(3) Mr 0 - r° 

(4) P° - r° 

(5) For k - 0,1, 


(1) a 


P »Kp k ) 


lc+1 


It It 
u* + Cip* 


(2) u 

(3) If lu k+l -u k l co < e then 
stop, otherwise continue. 

k+1 k „ k 

(4) r - r - oKp 

(5) Mr k+1 - r k+1 


( 6 ) 8 


-X 


'k+1 _k+l 


1 


(r k ,r k ) 

(7) p k+1 - r k+1 + Bp k 
Algorithm 1. PCG Algorithm 


We note that the standard conjugate gradient 
algorithm results by choosing M - I. 

For vector machines, if M » I, all steps of 
the Iteration loop except (1) and (6) can be 
vectorized. In particular, the multiplication 
Kp, for K sparse, vectorizes after a suitable 
ordering of the equations and will be discussed in 
detail in Section 3. The difficulty arises in the 
formation of the Inner products necessary to 
calculate a and 8. These calculations require 
a phase In which N partial sums must be added 
together and therefore do not vectorize well. 

For parallel arrays like the Finite Element 
Machine (Jordan (1978), Adam3 [1982] ) ; the 
calculation of u,r, and p can be distributed 
to the individual processors and the necessary 
communication between processors can be performed 
on the dedicated local links. The convergence 
test in (3) can be performed by using the flag 
network. However, for a large number of 
processors, the calculations of o and B can be 
expensive since the number of values to be summed 
for each inner product is equal to P, the number 
of processors. Jordan [1979] realized that this 
was potentially detrimental to the efficiency of 
the method on this machine, and as a result, a 
special hardware circuit (sum/max) was designed to 
perform the P sums In o(log 2 P) time. 

Since Algorithm 1 has two inner products per 
Iteration that will become costly os N (on 
vector machines) or P (on arrays) Increases, a 
natural goal is to devise a preconditioner that 
will reduce the number of CG Iterations, and hence 
the number of inner products, while being 
Inexpensive to implement. In the next section 
preconditioners that are based on taking m steps 
of an Iterative method are described. In Section 
3, the Implementations of these methods on the 
CYBER 203/205 and the Finite Element Machine are 


The research reported in this paper was supported In part by the National Aeronautics and Space 
Administration under NASA Grant NAG1-46 while the author was at the University of Virginia, 
Charlottesville, VA. and In part by the National Aeronautics and Space Administration Contract Nos. 
NAS1-15810, NASl-17070 and NAS1-17130 while the author was in residence at ICASE, NASA Langley Research 
Center, Hampton, VA 23665. 
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given for a system of equations that results from 
an example structural engineering problem. 

Results for this problem on the CYBER 203 and the 
Finite Element Hachlne are given In Section 4. 

2. H-Step Parallel Preconditloncrs 
2.1 Choosing H 

The preconditioned conjugate gradient' 

algorithm of the last section requires a symmetric 
and positive definite preconditioning matrix M* 
The question Is how _to choose M so that the 

condition number of K, 

maxi 
i 1 

la as small aa possible. 

The best choice for M In the sense of 
minimizing la M - K but this gains 

nothing since Kr - r la Just as difficult to 
solve aa Ku - f. A class of preconditioners that 
appears to be easily Implemented on parallel 

computers arises by choosing M to be a splitting 
of K that describes a linear stationary 
iterative method. As an example, the SSOR 
splitting of K yields 

M (2.1) 

vhere D,-L, and -U arc the diagonal, strictly 
lover, and strictly upper parts of K respective- 
ly. This splitting has been considered extensive- 
ly In the literature as a preconditioner; for 
example, refer to Concus, Golub, O'Leary [1976] 
and the references therein. Now, If the matrix 
K Is ordered by the Multicolor ordering (Adams 
and Ortega [1982)), the system Mr - r can be 
Implemented on parallel computers as a forward 
followed by a backward Multicolor SOR Iteration 

applied to Kr-r with Initial guess r^-0 and 

vill be explained In more detail In Section 3. 
The question now arises whether It would be 
beneficial to take more than one step of a linear 
stationary Iterative method to produce a 
preconditioner M that more closely approximates 
K. If this Is done, the resulting preconditioning 
matrix Is 

M - p(l4G+...4G n - 1 )“ 1 . (2.2) 

Now, M must be symmetric and positive definite to 
be considered ns a preconditioner. The necessary 
and sufficient conditions for M to satisfy these 
requirements are given In Adams [1982] and we only 
note here that If P Is the SSOR splitting matrix 
these conditions are met. Me also note that 
Dubois, Greenbnura, and Rodrlque [1979] considered 
a truncated Neumann series for K _1 as a 
precondltlorier which corresponded to a Jacobi 
splitting where P - dlag(K). 

Even though the preconditioner In (2.2) for 
the SSOR splitting Is symmetric and positive 
definite, the question of how well the resulting 
PCG method will reduce the number of CG Iterations 
must bo answered. In Adams [1982], for the SSQR 
splitting, the condition number of the matrix K 
of (1.2) was proven to decrease ns the number of 
steps of the precondltoner in^<2j2) Increases; 

however, the maximum ratio of - was shown to 


be m. In practice, for larger m, this reduction 
may not be enough to balance the Increase In the 
work that must be done by the precondltoner (as 
results In Section 4 verify). However, by 
parametrizing this precondltoner. the method la 
very effective. This parametrlzatlon Is briefly 
discussed In the next section and the parameters 
for the SSOR splitting sre given. 

2*2 Parametrizing M 

Johnson, Mlcchelll, and Paul [1982] have 
suggested symmetrlcslly scaling the matrix K to 
have unit diagonal and then taking ro terms of a 
parametrized Newmann series for « (I-G) -1 as 

the value for M~ . This corresponds to a 

symmetric preconditioning matrix whose Inverse Is 
a polynomlnal of degree m-1 In G, 

M” 1 - a Q I + c^C + o 2 G 2 +...+ a^C*- 1 (2.3) 

derived from the Jacobi splitting, 

K - I - G (2.4) 

of K; hence, the solution to M n r " r can be 
Implemented by taking m steps of the Jacobi 
iterative method applied to Kr-r with initial 

guess r^ - 0. Johnson, et.al. choose the 
n.'s so that the eigenvalues of MJJ^K, and hence 
tno3e of M m , are positive on the interval 
[lj,A n ] that contains the eigenvalues of K and 
are as close to 1 as possible In some sense such 
as the mln-max or the least squares criteria. 
Clearly, if m-1, M“*K - a^K and the condition 
number of M” 1 K is the same for all Oq * 0. 

Hence, we are only interested In ra > 1. 

Me now generalize this Idea for any splitting 
of the matrix K, 

K - P - Q. (2.5) 

If G - P”*Q, then by parametrizing (2.2), the 
Inverse of the m-atep preconditioner becomes 

M” 1 - (a 0 I+a 1 G+a 2 G 2 +...+a n _ 1 G“ -1 )p -1 (2.6) 

and will be symmetric If P is symmetric. Me 
choose the values of so that the eigenvalues 

of M” 1 K are positive on the Interval 
that contains the eigenvalues of P“ l K and are as 
close to 1 as possible in some sense such as the 
mln-max or least squares criteria. For the least 
squares criteria, the values of a that 
correspond to the SSOR splitting are given In 
Table 1 for m - 2,3, and 4. 


Table 1. 


a 

Values for 

the m-atep 

SSOR PCG Method 

jn 

Ho 

2 l 

h. 

^3 

2 

1.00 

5.00 



3 

1.00 

-2.00 

7.00 


4 

1.00 

7.00 

-24.50 

31.50 


In the next section we describe how to Implement 
the cr-step parametrized SSOR PCC method on the 
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CYBER 203/205 and on the Finite Element Machine 
and In Section 4, resulte on these machines are 
given 4 

3* Implementation of the m-steo SSOR PCQ Method 

A We first describe the algorithm for solving 
Mr « r, where M is the preconditioning matrix 
given by (2.6). To be concrete, this description 
will be given for the following test problem. 

The domain considered will be a rectangular 
plate discretized with triangular finite elements 
over which linear basis functions are defined. The 
nodes of the triangles are colored Red, Black, and 
Green so that nodes on a given triangle are 
different colors as shown in Figure 1. This 

coloring; as described in Adams and Ortega [1982 J, 
decouples the equations so that an implementation 
on either vector or array computers is possible as 
will become more apparent later in this 
discussion. 



Green(v). How, if the equations at the nodes in 
Figure 1 are numbered by these aix colors^ from 
bottom to top, left to right, the system Kr ■ r 
has the form, 


‘ D ll 

®12 

B 13 

b 14 

b 15 

b 16 


"A - 

r l 


r l 

.T 

“12 

I >22 

b 23 

B 24 

b 25 

b 26 


A 

r 2 


r? 

B 13 

b t 

®23 

D 33 

B 34 

B 35 

B 36 


A 

r 3 

■ 

r 3 

b t 

“14 

b 24 

B 34 

D 44 

B 45 

B 46 


A 

r 4 


r 4 

„T 

“15 

b 25 

b 35 

n T 

B 45 

d 55 

b 56 


A 

r 5 


r 5 

„T 

“16 

b 26 

b 36 

«t 

B 46 

n T 

B 56 

D 66 


/6_ 


r 6 


(3.1) 


are 


where ^12* B 34* B 56* ^ii* ^ 1 to 6 

diagonal matrices. 

The SSOR iteration can be realized by a 
forward followed by a backward Multicolor SOR 
iteration, (Adams and Ortega [1982]), but is only 
as expensive as one Multicolor SOR iteration since 
a technique of Conrad and Wallach [1979] can be 
used to save results in an auxiliary vector, y, 
from the forward pass to be used in the backward 
pass. The procedure is given below for solving 
Mr “ r of Algorithm 1. The relaxation parameter 
u of the SSOR method causes no problems in the 
implementation and will be set to one here for 
simplicity. 


(1) r - Oj y - 0 


Figure 1. Plate (Triangular Elements) 


(2) For ami to m 


The problem is to determine the displacements, 
say u and v, in the x and y directions 

respectively at each node in the plate whenever 

the plate is loaded on one edge and constrained on 
another. The partial differential equations of 
plane stress that govern these displacements are 
well known, see Norrie and DeVries [1978], but do 
not contribute to the discussion here. The 
important point to make is that the stiffness 

matrix K of (1.1) will be symmetric and positive 
definite and will have dimension 2 ab * 2 ab 

where a is the number of rows of nodes and b 
is the number of columns of unconstrained nodes (2 
unknowns at each node), and each row of K will 
contain at most 14 nonzero elements which 
correspond to the grid point stencil for linear 
triangular elements shown in Figure 2. 



Figure 2. Grid Point Stencil 

Observe from Figures 1 and 2 that while there 
is no coupling between the equations at two nodes 
of the same color, the equations at a given node 
do couple. Hence, to completely decouple the 
system, six colors are necessary; namely, Red(u), 
Red(v), Black(u), Black(v), Crcen(u), and 


(1) 

For 

c *• 1 to 6 


(1) 

e r* T * 

Fora x ■ -l B. r. 

j-1 Je 3 


(2) 

A 

Solve D r “ x + y + a r 
C C * C ®—8 C 


(3) 

Set y « x 
c 

(2) 

For 

c " 5 down to 2 


(1) 

6 

Form x - B r, 

j-c+1 03 3 


(2) 

Solve D c r c - x + y c + (Vs r c 


(3) 

Set y - x 

(3) Solve Dj 

' V * 

r i - "i 2 B ij r j + y i + “o r i 

Algorithm 2. 

m-8tep 6-color SSOR 


Notice that the values of « above are the 

parameters that were given in Fable 1, and if no 
parametrlzatlon is desired, these are simply set 
to one. We also point out that Algorithm 2 can 
easily be modified to solve problems whose domains 
are discretized by more complicated finite 
elements or finite differences as long as a 
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multicolor ordering is used. For more details nee 
Adame an! Ortega [1982J. We now turn to the 
Implementation of Algorithm 1 In conjunction with 
Algorithm 2 on the CYBER 203/205. 

3. 1 CYBER 203/205 Implementation 

On tho CYBER 203/205, vectore consist of 
contiguous storAge locations and maximum 
efficiency of vector operations is achieved for 
very long vectors. For vectors of length 1000 
around 90X efficiency Is obtained, but this drops 
to approximately 50Z or less for vectors of length 
100 and 101 for vectors of length 10. 

To achieve the maximum vector length for our 
teat problem the u equations at the Red nodes 
(left to right, bottom to top) including the 
constrained nodes are numbered first, followed by 
the corresponding v equations at the Red nodes, 
then by the Black u, Black v, Green u, and Green 
V equations. The numbering of the constrained 
equations Is necessary for ease of Implementation 
given the CYBER'a contiguous storage requirement 
but also increases the vector length from l/3ab 
to -Ja(b+1). Of course, the actual updating of 
the storage locations corresponding to these 
constrained nodes Is prohibited by the control 
vector feature on this machine, see Ortega and 
Voigt (1977], and for large values of a and b 
little inefficiency Is Incurred. For a unit 
square plate, the maximum vector length for our 
.2 

test problem Is ^ and Is around 1000 when 

S * 55, or equivalently when tho width of each 
trlangla Is equal to 1/54. 

The contiguous storage requirement coupled 
with the manner In which the nodes are colored 
Imposes a restriction on the number of nodes that 
can be In each row of the plate. In particular, 
the last node In the first row must be Black so 
that the coloring R/B/G/R/B/G, etc. wraps around 
from one row to the next. ^ 

Now, the calculations of Ku and Kp In 
Algorithm 2 can be done by a straightforward 
generalization of Madsen, Rodrlque, and Karush's 


(1976] matrix multiplication by diagonals scheme 
since K of (3.1) has the structure shown in 
(3.2) (and will be stored by these diagonals as 
well)t 



technique. The subtraction In the convergence 
test lu k -u k 8„ < c vectorizes and the absolute 
value la performed by the vector absolute value 
function that Is available on the CYBER. The 
Inner products for the calculation of a and 0 
are done by a call to an Inner product routine 
which utilizes the machine's vector hardware.; 
however, the additions of the partial sums make 
this operation considerably slower than the other 
vector operations required In the algorithm. 

Next, we turn to the Implementation of 
Algorithm 1 In conjuctlon with Algorithm 2 on the 
Finite Element Machine. 

3.2 Finite Element Machine Implementation 

The first task for the Implementation on this 
machine Is to assign the nodes (and hence 
equations at the nodes) of the plate to tho 
processors. This Is done by assigning each 

processor, as nearly os possible, an equal number 
of Red/Black/ and Green unconstrained nodes as 
illustrated In Figures 3a, 3b, and 3c, where In 
each Figure, the node colorings may repeat beyond 
the region shown. 
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In contrast to the CYBER implementation we need 
not be concerned with numbering the constrained 
nodes, but Instead we should require that each 
processor receive an equal distribution of each 
color of the unconstrained nodes. 

Since memory is distributed on the Finite 
Element Machine, each processor stores the portion 
of u, p, r, r and K that corresponds to its 

collection of nodes. For each equation that la 
assigned to a processor, 14 storage locations ara 
reserved for the nonzero coefficients of K that 
ccrrespond tc the grid point stencil in Figure 
2, For more information about these data 
structures see Adams 1 1982 ] . In addition, storage 
must be reserved in each processor for the portion 
of p that must be received from neighbor 

processors during the calculation of Kp each 
iteration. For example, in Figure 3b, processor 1 
must reserve storage for the components of p 
that correspond to the 3 border nodes in processor 
3 and the 3 border nodes in processor 2, but no 
components are received from processor 4 since no 
nodes in processors 1 and 4 share a common 
triangle. This same storage may be used initially 
for u° during the calculation of Ku°. 
Similarly, storage must be reserved for the r 
components associated with the equations at border 
nodes in neighbor procesors for the 

*r A A 

multiplications of Bi c r, and B c1 r 1 ln 

Algorithm 2. ' J 

Tho sending and receiving of the border p 
components in each CG Iteration in Algorithm 1 and 
the border r components during each step of the 
preconditioner in Algorithm 2 Is only (for 
rectangular regions) between neighbor processors 
and in particular for our teat problem will 
require six of the machine's eight nearest 
neighbor links as shown in Figure 4 for processor 
P. 



Figure 4. FEM Local Links 

Renee, the communication required for the m-step 
SSOR preconditioner on this machine is completely 
local and the amount of data that a given 
processor must communicate can be seen from Figure 
3 to be dependent on its number of neighbors as 
well as the dimension of the rectangle of nodes 
assigned to It. To reduce the time required for 
the I/O, the values of each color to be sent to a 
given neighbor can be packaged and sent as one 
record and likewise for the values of a particular 
color to be received from a given neighbor. If 
this is done, it becomes advantageous to think of 
the two equations at the same node as being the 
same color, because, on this machine, it does not 
matter that they couple since they will always be 
assigned to the same processor. 

The convergence test in Algorithm 1 is 
implemented by the signal flag network. Each 
processor raises its convergence flag whenever its 
portion of u values are within the stopping 
criterion. The processors are then synchronized 


and tested to see if all flags are raised; if so, 
the iteration stops — if not, all flags are 
lowered and the iteration continues. 

Lastly, we summarize our remarks about the 
Finite Element Machine implementation of Algorithm 
2 by providing a parallel version in Algorithm 3 
that will be executed by processor p. The 
subscript p denotes the portion of a vector that 
la assigned to processor p, the subscript n 
denotes tho portion of the vector that is received 
from all of processor p's neighbors and the 
subscript t denotes the total vector which 
consists of the components received by, as well as 
those assigned to, processor p. 


( 1 ) 

( 2 ) 


(3) 


r t - 0; y p - 0 


For 

a - 

1 to m 


(1) 

For 

c “ 1 to 6 



0) 

C r* T * 

* “ 



(2) 

* 

D c,p r e,p " x + Fp + a a-a r p 



(3) 

y p -x 



(4) 

If c mod 2-0 then 




(1) Send border ^portion 
*c-l,p and *c,p 

of 



(2) Receive r c _ l>n 

r c,n 

and 

(2) 

For 

c ■ 5 down to 2 



(1) 

Y 

* ■ "I B d r i.t 

j-c+l CJ J,c 



(2) 

D„ _r„ „ - x + y_ + ar 
c.p c,p 'p m-s p 



(3) 

y “x 
7 P 



(4) 

If c mod 2*0 then 




(1) Send border ^portion 
r c+l,p an< * r c,p 

of 



(2) Receive r c +l,n 

r c,n 

and 

' 


6 



Solve D. r. - -£ B. ,r. „ + y + a r 
l.P l.P l,t p ° P 


Algorithm 3. FEM tn-step 6-color SSOR 


4. Results 

The example plane stress problem was run on 
the CYBER 203 at the NASA Langley Research Center 
for a unit square plate for varying mesh eizes. 
Table 2 gives the number of iterations, I, and 
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time, T, In seconds to solve this problem using are denoted by P. the number of rows In the plate 

n « 0-10. The parametrized preconditioner results by a, and the maximum vector length by v. 


Table 2. CYBER 203 Iterations and Timings m-step SSOR PCG 



V 

- 22 

V 

■ 41 

V - 

132 

V 

- 561 

V - 

1282 

v - 

2134 


a. 

- 8 


■ 11 

ft - 

20 


- 41 

a 

- 62 

a 

-_80 

IQ 

i 

X 

1 

X 

X 

X 

i 

X 

i 

X 

i 

T 

0 

112 

.133 

157 

.213 

271 

.565 

536 

3.293 

788 

11.845 

929 

22.780 

i 

52 

.129 

66 

.184 

111 

.454 

214 

2.373 

311 

7.832 

395 

17.194 

2 

38 

.143 

50 

.208 

7? 

.478 

152 

2.428 

221 

7.773 

280 

17.380 

2P 

31. 

.116 

40 

.167 

61 

.369 

118 

1.885 

172 

6.052 

218 

13.534 

3 

31 

.155 

39 

.216 

65 

.520 

124 

2.585 

181 

8.174 

229 

18.469 

3P 

24 

.121 

30 

.167 

46 

.369 

88 

1.836 

129 

5.828 

163 

13.151 

4P 

22 

.138 

24 

.166 

35 

.350 

67 

1.726 

99 

5.471 

124 

12.306 

5P 

IS 

.143 

20 

.167 

29 

.347 

56 

1.716 

82 

5.345 

104 

12.260 

6P 

18 

.159 

18 

.175 

25 

.348 

47 

1.670 

70 

5.263 

88 

12.011 

7P 





26 

.413 

43 

1.739 

64 

5.451 

80 

12.410 

8P 





21 

.375 

36. 

1.634 

54 

5.139 

69 

11.985 

9P 







33 

1.660 

48 

5.056 

61 

11.731 

10P 







31 

1.709 

44 

5.070 

55 

11.594 


It should be noted that the Inner product routine 
that vas used for these results was developed at 
Langley and Is optimized for the CYBER 203. 
Several observations can be made from these six 
test cases. 

(1) The parametrized preconditioner Is 
better with respect to both the number 
of Iterations and the execution time 
than the corresponding unparametrized 
pre conditioner. 

(2) The optimal number of steps of the 
parametrized preconditioner Increased 
as the vector length Increased. 


In relation to (2), an Interesting question 
la to determine how many steps would be beneficial 
for a large problem. The answer to this Is quite 
simple If the number of Iterations. N m , could be 
expressed as a function of m, since the execution 
time of the m-step method can be expressed as 

T(m) - N n (A + mB) (4.1) 

where A Is the time for one outer conjugate 
gradient Iteration and B Is the time for 1 step 
of the preconditioner. Now if we assume that 
*Vn < K m . takln 8 art - 1 steps Is more beneficial 
than taking m steps whenever 


(1) - mN^ < 0. (This means that 

the total number of Inner loops Is less for nrt-1 
steps) 


or (2) B/A < 


"nrH 


(mtl)N . mN 
nrfl m 


(4.2) 


The Inequalities In (4.2) explain for larger 
problems when more steps of the preconditioner 
should be taken. For instance, the values of the 
left and rlRht side of Inequality (2) when m-9 
are (.81,. 15), (.68, .5), and (.76,6) for a » 
41,62, and 80 respectively. Hence, ten steps 
are preferable to nine only for a " 80. 

We now give the Finite Element Machine 
results. The example plane stress problem with 6 
rows and 6 columns of nodes (60 equations) was 
solved on a 1, 2 and then on a 5-processor Finite 
Element Machine using the m-step SSOR PCG 
method. For this problem the assignment of 
unconstrained nodes to the processors Is shown In 
Figure 5. 



Figure 5. FEM Processor Assignments 


Observe from Figure 5 that for the two and five 
processor assignments each processor has an equal 
number of R, B, and G nodes as well as an 
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equal number of border nodes to be communicated . 
Therefore, In the absence of communication time 
and any differences In processor speeds, a speedup 
of two (five) over the one processor case should 
be realized. 


The number of Iterations and the time In 
seconds for the above assignments are given in 
Table 3. The speedups for the. two and five 
processor assignments also are Included. 


Table 3. FEM Iterations, Timings, Speedups m-atep SSOR PCG 


m. 

I_ 

P - “ 1 

1 

I_ 

P " 2 
T 

Speedup 

I_ 

P ,T. 5 . 

T 

Speedup 

0 

48 

63.35 

48 

33.01 

1.92 

48 

17-70 

3.58 

1 

19 

47.90 

19 

25.85 

1.85 

19 

14.85 

3.23 

2 

13 

48.75 

13 

26.65 

1.83 

13 

15.50 

3.15 

2P 

11 

41.95 

11 

22.95 

1.83 

11 

13.30 

3.15 

3 

11 

54.95 

11 

30.15 

1.82 

11 

17.65 

3.11 

3P 

8 

41.25 

8 

22.75 

1.81 

8 

13.25 

3.11 

4 

10 

62.40 

10 

34.30 

1.82 

10 

20.20 

3.09 

4P 

6. 

39.80 

6_ 

22.00 

1.81 

6. 

12.90 

3.09 

5P 

5 

40.60 

5 

22.50 

1.80 

5 

13.25 

3.06 

6P 

5 

47.05 

5 

26.20 

1.80 





Several observations can be made from Table 3. 

(1) The effectiveness of the preconditioner 
as a function of m was the same for the 
sequential and two and five processor 
cases (4p ,5p ,3p ,2p, 1, 2, 3,4) . 

(2) Taking more than one step of the 

unparametrized preconditioner was not 
advantageous. 

(3) The overhead for the CG(m-O) algorithm 

was less than that for the PCG Algorithm 
because for two and five processors the 
communications for the preconditioner 
rather than for the Inner products 

dominate the overhead. 

In regard to (3), if we keep the number of nodes 
per processor fixed and continue to add processors 
up to a certain number, say n , the overhead for 
the preconditioner will still ne more than that 
for the CG method and hence m “ 3P or 2P may 

become optimal; however, as the number of 

processors Increases beyond n a , the value of 
B/A In (4.2) will continue to decrease until 
m > 4p steps of the preconditioner will be 
optimal. The behavior of the m-step PCG Algorithm 
can be modelled as a function of the number of 
proceaaoro, the problem size, and the relative 
speed of arithmetic to communication times for the 
machine. For more details, see Adams [1982]. 

5. Summary and Conclusions 
The nt-step multicolor SSOR preconditioned 
conjugate gradient method described herein has 
been shown to be effective on vector computers and 
for a small problem was effective on the Finite 
Element Machine. As more processors and the 
sum/max hardware circuit become available on this 
machine, tho method will be tested on larger 


problems. This method does not face the usual 
difficulty in choosing the optimal relaxation 
parameter, u, for the multicolor SSOR method, 
since for this ordering and few colors in " 1 Is 
a good choice, see Adam3 [1983]. A problem still 
remains in applying the method to irregular 
regions since the grid must be colored and for 
array machines must also be distributed to the 
processors In light of this coloring. 
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